

## SRM Institute of Science and Technology College of Engineering and Technology School of Computing

Mode of Exam

OFFLINE

## DEPARTMENT OF COMPUTING TECHNOLOGIES

SRM Nagar, Kattankulathur – 603203, Chengalpattu District, Tamilnadu

Academic Year: 2023-2024 (Even) SET B-Answer key

Test: CLAT-1

Course Code & Title: 18CSE419T & GPU Programming

Year & Sem: III & V

Date: 14.02.24

Duration: 50 minutes

Max. Marks: 25

## **Course Articulation Matrix:**

| S.N | o. Co | urse Outcome | PO1 | PO2 | PO3 | PO4 | PO5 | PO6 | PO7 | PO8 | PO9 | PO10 | PO11 | PO12 |
|-----|-------|--------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|------|------|
| 1   | CO    | )1           | 3   | 2   |     |     |     |     |     |     |     |      |      |      |

|                              |                                                                                      |     |    | l  | I  |        |  |  |  |  |
|------------------------------|--------------------------------------------------------------------------------------|-----|----|----|----|--------|--|--|--|--|
| Part – A<br>( 5 x1= 5 Marks) |                                                                                      |     |    |    |    |        |  |  |  |  |
| Instructions: Answer all     |                                                                                      |     |    |    |    |        |  |  |  |  |
| Q.                           | Question                                                                             | Ma  | BL | CO | PO | PI     |  |  |  |  |
| No                           |                                                                                      | rks |    |    |    | Code   |  |  |  |  |
| 1                            | The computer architecture aimed at reducing the time of execution of instructions is | 1   | 1  | 1  | 1  | 1.3.1* |  |  |  |  |
|                              | a)RISC                                                                               |     |    |    |    |        |  |  |  |  |
|                              | b)CISC                                                                               |     |    |    |    |        |  |  |  |  |
|                              | c)ISA                                                                                |     |    |    |    |        |  |  |  |  |
|                              | d)IANA                                                                               |     |    |    |    |        |  |  |  |  |
|                              |                                                                                      |     |    |    |    |        |  |  |  |  |
| 2                            | Ans:A                                                                                | 1   | 1  | 1  | 1  | 1.3.2  |  |  |  |  |
|                              | The n × n matrix is partitioned among n                                              | 1   | 1  | 1  | 1  | 1.3.2  |  |  |  |  |
|                              | processors, with each processor storing                                              |     |    |    |    |        |  |  |  |  |
|                              | complete of the matrix.                                                              |     |    |    |    |        |  |  |  |  |
|                              | a)row                                                                                |     |    |    |    |        |  |  |  |  |
|                              | b)column                                                                             |     |    |    |    |        |  |  |  |  |
|                              | c)both                                                                               |     |    |    |    |        |  |  |  |  |
|                              | d)depend on processor                                                                |     |    |    |    |        |  |  |  |  |
|                              | a)depend on processor                                                                |     |    |    |    |        |  |  |  |  |
|                              | Ans:A                                                                                |     |    |    |    |        |  |  |  |  |
| 3                            | In coarse-grained parallelism, a program is split                                    | 1   | 1  | 1  | 1  | 1.3.1  |  |  |  |  |
|                              | into task and                                                                        |     |    |    |    |        |  |  |  |  |
|                              | Size                                                                                 |     |    |    |    |        |  |  |  |  |
|                              | a)Large tasks , Smaller Size                                                         |     |    |    |    |        |  |  |  |  |
|                              | b)Small Tasks , Larger Size                                                          |     |    |    |    |        |  |  |  |  |
|                              | c)Small Tasks , Smaller Size                                                         |     |    |    |    |        |  |  |  |  |
|                              | d)Equal task, Equal Size                                                             |     |    |    |    |        |  |  |  |  |
|                              | Ans:A                                                                                |     |    |    |    |        |  |  |  |  |
| 4                            | Calling a kernel is typically referred to as _                                       | 1   | 1  | 1  | 1  | 1.3.3  |  |  |  |  |
|                              |                                                                                      |     |    |    |    |        |  |  |  |  |

| a)kernel thread b)kernel initialization c)kernel termination d)kernel invocation  Ans:D  5. A code, known as GRID, which runs on GPU consisting of a set of a)32 Thread b)32 Block c)Unit Block d)Thread Block Ans:D  Part - B (2 x4 = 8 Marks) |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| c)kernel termination d)kernel invocation  Ans:D  5. A code, known as GRID, which runs on GPU consisting of a set of a)32 Thread b)32 Block c)Unit Block d)Thread Block Part – B                                                                 |
| d)kernel invocation  Ans:D  5. A code, known as GRID, which runs on GPU consisting of a set of  a)32 Thread b)32 Block c)Unit Block d)Thread Block Part - B                                                                                     |
| Ans:D  5. A code, known as GRID, which runs on GPU consisting of a set of  a)32 Thread b)32 Block c)Unit Block d)Thread Block Ans:D  Part – B                                                                                                   |
| 5. A code, known as GRID, which runs on GPU consisting of a set of  a)32 Thread b)32 Block c)Unit Block d)Thread Block Part – B                                                                                                                 |
| a)32 Thread b)32 Block c)Unit Block d)Thread Block  Ans:D                                                                                                                                                                                       |
| b)32 Block c)Unit Block d)Thread Block  Ans:D  Part – B                                                                                                                                                                                         |
| b)32 Block e)Unit Block d)Thread Block  Ans:D  Part – B                                                                                                                                                                                         |
| d)Thread Block  Ans:D  Part – B                                                                                                                                                                                                                 |
| Ans:D Part – B                                                                                                                                                                                                                                  |
| Part – B                                                                                                                                                                                                                                        |
|                                                                                                                                                                                                                                                 |
| (2  vA - 9  Mayba)                                                                                                                                                                                                                              |
|                                                                                                                                                                                                                                                 |
| Instructions: Answer any 2  6 Discuss the advantages and disadvantages of multicore trajectory and 4 1 1 1 1.3.1                                                                                                                                |
| many thread trajectory.                                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                                 |
|                                                                                                                                                                                                                                                 |
| The multicore trajectory seeks to maintain the execution                                                                                                                                                                                        |
| speed of sequential programs while moving into multiple                                                                                                                                                                                         |
| cores. The multicores began with two core processors                                                                                                                                                                                            |
| with the number of cores increasing with each semiconductor process generation. A current exemplar is                                                                                                                                           |
| the recent Intel Core i7t microprocessor with four                                                                                                                                                                                              |
| processor cores, each of which is an out-of-order, multiple                                                                                                                                                                                     |
| instruction issue processor implementing the full X86                                                                                                                                                                                           |
| instruction set, supporting hyper- threading with two                                                                                                                                                                                           |
| hardware threads, designed to maximize the execu- tion                                                                                                                                                                                          |
| speed of sequential programs. In contrast, the                                                                                                                                                                                                  |
| many-thread trajectory focuses more on the execution                                                                                                                                                                                            |
| throughput of parallel applications. The many-threads                                                                                                                                                                                           |
| began with a large number of threads, and once again,                                                                                                                                                                                           |
| the number of threads increases with each generation. A                                                                                                                                                                                         |
| current exem- plar is the NVIDIA GTX680 graphics                                                                                                                                                                                                |
| processing unit (GPU) with 16,384 threads, executing in a                                                                                                                                                                                       |
| large number of simple, in-order pipelines.                                                                                                                                                                                                     |
| Many-threads processors, especially the GPUs, have led                                                                                                                                                                                          |
| the race of floating-point performance since 2003. As of                                                                                                                                                                                        |
| 2012, the ratio of peak floating-point calculation                                                                                                                                                                                              |
| throughput between many-thread GPUs and multicore                                                                                                                                                                                               |
| CPUs is about 10. These are not necessarily application                                                                                                                                                                                         |
| speeds, but are merely the raw speed that the execution                                                                                                                                                                                         |
| resources can potentially support in these chips: 1.5                                                                                                                                                                                           |
| teraflops versus 150 gigaflops double precision in 2012.                                                                                                                                                                                        |
|                                                                                                                                                                                                                                                 |
| 7 Draw the GPU architecture and explain . 4 1 1 2.2.3                                                                                                                                                                                           |
| 8 Discuss about the fundamental design factors of CPU and GPU. 4 1 1 1.3.1                                                                                                                                                                      |
| The design of a CPU is optimized for sequential code performance. It                                                                                                                                                                            |
| makes use of sophisticated control logic to allow instructions from a                                                                                                                                                                           |

single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. • More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. •Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip. •Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. •The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games. •This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. •The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. • The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput. •Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM. •This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute. •The CPUs, on the other hand, are designed to minimize the execution latency of a single thread. •Large last-level on-chip caches are designed to capture frequently accessed data and convert some of the long-latency memory accesses into short-latency cache accesses.

|   | Part - C $(1x12 = 12 Marks)$                                                                                                                                                                                                                                                                                |    |   |   |   |       |
|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|---|---|---|-------|
|   | Instructions: Answer any 1                                                                                                                                                                                                                                                                                  |    |   |   |   |       |
| 9 | With a suitable drawing illustrate the design and decision factors involved in choosing the CPU and GPU for heterogeneous parallel computing.                                                                                                                                                               | 12 | 1 | 1 | 1 | 2.1.2 |
|   | The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution.                             |    |   |   |   |       |
|   | More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications.                                                                                                                                                                      |    |   |   |   |       |
|   | •Neither control logic nor cache memories contribute to the peak calculation speed. As of 2012, the high-end general-purpose multicore microprocessors typically have six to eight large processor cores and multiple megabytes of on-chip.                                                                 |    |   |   |   |       |
|   | •Memory bandwidth is another important issue. The speed of many applications is limited by the rate at which data can be delivered from the memory system into the processors. Graphics chips have been operating at approximately six times the memory bandwidth of contemporaneously available CPU chips. |    |   |   |   |       |
|   | •The design philosophy of GPUs is shaped by the fast-growing video game industry that exerts tremendous economic pressure for the ability to perform a massive number of floating-point calculations per video frame in advanced games.                                                                     |    |   |   |   |       |
|   | •This demand motivates GPU vendors to look for ways to maximize the chip area and power budget dedicated to floating-point calculations.                                                                                                                                                                    |    |   |   |   |       |
|   | •The prevailing solution is to optimize for the execution throughput of massive numbers of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency.                                                                              |    |   |   |   |       |
|   | • The reduced area and power of the memory access hardware and arithmetic units allows the designers to have more of them on a chip and thus increase the total execution throughput.                                                                                                                       |    |   |   |   |       |
|   | •Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to the DRAM.                                                                                                            |    |   |   |   |       |
|   | •This design style is commonly referred to as throughput-oriented design since it strives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute.                                                      |    |   |   |   |       |

- •The CPUs, on the other hand, are designed to minimize the execution latency of a single thread.
- •Large last-level on-chip caches are designed to capture frequently accessed data and convert some of the long-latency memory accesses into short-latency cache accesses.
- •The arithmetic units and operand data delivery logic are also designed to minimize the effective latency of operation at the cost of increased use of chip area and power.
- •By reducing the latency of operations within the same thread, the CPU hardware reduces the execution latency of each individual thread. However, the large cache memory, low-latency arithmetic units, and sophisticated operand delivery logic consume chip area and power that could be otherwise used to provide more arithmetic execution units and memory access channels.

This design style is commonly referred to as latency-oriented design.

- •GPUs are designed as parallel, throughput oriented computing engines and they will not perform well on some tasks on which CPUs are designed to perform well.
- •For programs that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs.
- •When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs.
- •Therefore, one should expect that many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs.
- •This is why the CUDA programming model, introduced by NVIDIA in 2007, is designed to support joint CPU-GPU execution of an application.
- The demand for supporting joint CPUGPU execution is further reflected in more recent programming models such as OpenCL , OpenACC , and C++ AMP.

•

**Several other factors can** be even more important.

- First and foremost, the processors of choice must have a very large presence in the marketplace, referred to as the installed base of the processor.
- The reason is very simple. The cost of software development is best justified by a very large customer population.

|    | <ul> <li>Applications that run on a processor with a small market presence will not have a large customer base.</li> <li>This has been a major problem with traditional parallel computing systems that have negligible market presence compared to general-purpose microprocessors.</li> <li>Only a few elite applications funded by government and large corporations have been successfully developed on these traditional parallel computing systems.</li> </ul>                                                                                                                                                                                                                                                                     |    |   |   |   |       |
|----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|---|---|---|-------|
|    | <ul> <li>Another important decision factor is practical form factors and easy accessibility.</li> <li>Until 2006, parallel software applications usually ran on data center servers or departmental clusters. But such execution environments tend to limit the use of these applications. For example, in an application such as medical imaging, it is fine to publish a paper based on a 64-node cluster machine.</li> <li>But actual clinical applications on magnetic resonance imaging (MRI) machines have been based on some combination of a PC and special hardware accelerators.</li> </ul> Control ALU ALU GPU  Cache  DRAM FIGURE 1.1 CPUs and GPUs have fundamentally different design philosophies                         |    |   |   |   |       |
| 10 | Draw the process of fixed function NVIDIA graphics pipeline operations in detail.  the leading performance graphics hardware was fixed-function pipelines that were configurable, but not pro- grammable. In that same era, major graphics Application Programming Interface (API) libraries became popular. An API is a standardized layer of software, that is, a collection of library functions that allows applica- tions (e.g., games) to use software or hardware services and functionality. For example, an API can allow a game to send commands to a graphics processing unit to draw objects on a display. One such API is DirectX, Microsoft's proprietary API for media functionality. The Direct3D compo- nent of DirectX | 12 | 1 | 1 | 1 | 2.1.2 |

provides interface functions to graphics processors. The other major API is OpenGL, an open-standard API supported by multiple vendors and popular in professional workstation applications. This era of fixed-function graphics pipeline roughly corresponds to the first seven generations of DirectX.

he host interface receives graphics commands and data from the CPU. The commands are typically given by application programs by calling an API function. The host interface typically contains a specialized DMA hardware to efficiently transfer bulk data to and from the host system memory to the graphics pipeline. The host interface also communicates back the status and result data of executing the commands.

Before we describe the other stages of the pipeline, we should clarify that the term vertex usually means the "corners" of a polygon. The

GeForce graphics pipeline is designed to render triangles, so vertex is typi- cally used to refer to the corners of a triangle. The surface of an object is drawn as a collection of triangles. The finer the sizes of the triangles are, the better the quality of the picture typically becomes. The vertex control stage in Figure 2.1 receives parameterized triangle data from the CPU. The vertex control stage converts the triangle data into a form that the hardware understands and places the prepared data into the vertex cache.

The vertex shading, transform, and lighting (VS/T&L) stage in Figure 2.1 transforms vertices and assigns per-vertex values (colors, nor- mals, texture coordinates, tangents, etc.). The shading is done by the pixel shader hardware. The vertex shader can assign a color to each vertex but it is not applied to triangle pixels until later. The triangle setup stage further creates edge equations that are used to interpolate colors and other per- vertex data (e.g., texture coordinates) across the pixels touched by the tri- angle. The raster stage determines which pixels are contained in each tri- angle. For each of these pixels, the raster stage interpolates per-vertex values necessary for shading the pixel, which includes color, position, and texture position that will be shaded (painted) on the pixel.

The shader stage in Figure 2.1 determines the final color of each pixel. This can be generated as a combined effect of many techniques: interpolation of vertex colors, texture mapping, per-pixel lighting mathematics, reflections, and more. Many effects that make the rendered images more realistic are incorporated in the shader stage



<sup>\*</sup>Performance Indicators are available separately for Computer Science and Engineering in AICTE examination reforms policy.

## Course Outcome (CO) and Bloom's level (BL) Coverage in Questions



**Approved by the Audit Professor/Course Coordinator**